Master Thesis Project
Unveiling Hidden Information in Unstructured Documents: Organization and Hybrid Retrieval with Knowledge Graphs.
Goal
We generalize the informational content of a document along three distinct dimensions:- Content: The textual content of the document.
- Structure: The hierarchical organization of the document, including headings, paragraphs, and sections.
- Entities: The entities mentioned within the documents and their semantic relations.
Given this categorization, our objective is to design a retrieval system, ranging from document structuring to the retrieval model itself, that can determine the most relevant chunks by integrating and combining all these informational dimensions. Specifically, the main objectives of this thesis are the following:
- Document Structuring: Developing a Knowledge Graph (KG) representation that effectively models a collection of unstructured documents, focusing on retaining and making explicit all of their informational dimensions: both the textual content and the more implicit metadata, such as the organization of passages and the relationships between entities.
-
Hybrid Retrieval: Designing a retrieval system capable of leveraging this enriched structure, and therefore taking full advantage of all three informational dimensions to retrieve the most relevant chunks comprehensively.
Investigating which retrieval methods are more critical in identifying the most relevant passages, and consequently design the system to balance the weights in favor of those methods. - Comparison with traditional RAG: Understanding whether structuring and enriching documents can offer a valuable benefit in RAG systems.
Methodology
The work proceeds in two stages:- Knowledge-Graph Structuring. A custom parser converts 5 k+ Wikipedia pages into a Neo4j graph that simultaneously stores:
- Chunk nodes for textual content
- SubChapter nodes that preserve the document hierarchy, and structure
- Entity nodes linked by semantic relations harvested from DBpedia to preserve relations
- Hybrid Retrieval Engine. Six complementary retrievers, covering page structure, entity sub-graphs, and dense-vector similarity, score each chunk from different viewpoints. A lightweight neural network learns to fuse these scores, weighting each retriever according to its real contribution to relevance rather than treating them equally.
Results.
On the held-out split of Google NQ and on a 100-question synthetic Multi-Hop set, the hybrid approach:- surpasses the best single retriever and a “naive” graph + vector baseline on all metrics.
- delivers the highest F-scores (F1-3) for both single-hop and multi-hop queries.
- yields noticeably better RAG answers when assessed with ROUGE, cosine similarity, and RAGAS LLM judgments, without retraining for multi-hop tasks.